Linear Regression & Friends (OLS, Ridge, Lasso, Elastic Net)#
Linear regression is the hello world of predictive modeling: it’s simple enough to understand end-to-end, but rich enough to teach you the habits you’ll reuse everywhere:
how to translate a question into a loss function
how to solve an optimization problem (closed-form vs iterative)
how to diagnose overfitting
how regularization changes a model’s behavior
This notebook builds intuition first (with analogies), then stays honest to the math.
Learning goals#
By the end you should be able to:
fit simple and multiple linear regression
compute coefficients (“betas”) via:
closed-form normal equation
Cholesky factorization (linear algebra)
gradient descent (optimization)
understand and fit Ridge (L2), Lasso (L1) and Elastic Net (L1+L2)
use
scikit-learnversions and interpret the key parameters
Notation (quick)#
Targets: \(y \in \mathbb{R}^n\)
Features: \(X \in \mathbb{R}^{n\times d}\) (each row is a sample)
Coefficients: \(\beta \in \mathbb{R}^d\)
Intercept: \(b \in \mathbb{R}\) (sometimes written \(\beta_0\))
Prediction: \(\hat{y} = b + X\beta\)
Table of contents#
The story: one dial vs many dials
Simple linear regression (one feature)
Multiple linear regression (many features)
Getting betas
closed-form / normal equation
Cholesky solve
gradient descent
Regularization
Ridge (L2)
Lasso (L1)
Elastic Net (L1 + L2)
scikit-learnequivalents + parameter intuitionPractical checklist + pitfalls
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, SGDRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
1) The story: one dial vs many dials#
Imagine you’re predicting someone’s commute time.
With one feature (say, distance), linear regression is like a single dial: turn the slope up/down until the line fits.
With many features (distance, traffic, weather, time of day), it becomes a mixing board with many sliders.
OLS (ordinary least squares) says:
“Pick the dial settings (betas) that make the squared mistakes as small as possible.”
That’s it. Everything else in this notebook is different ways to do that—and to prevent the model from getting too clever.
2) Simple linear regression (one feature)#
Model#
With one feature \(x\), we predict:
\(\beta_0\) is the intercept (where the line crosses the y-axis)
\(\beta_1\) is the slope (how much \(y\) changes when \(x\) increases by 1)
Loss (OLS)#
We choose betas to minimize the mean squared error (MSE):
# Synthetic "one-feature" dataset
n_samples_simple = 120
x_simple = rng.uniform(0, 10, size=n_samples_simple)
beta0_true = 3.0
beta1_true = 2.0
noise = rng.normal(0, 2.0, size=n_samples_simple)
y_simple = beta0_true + beta1_true * x_simple + noise
fig = px.scatter(
x=x_simple,
y=y_simple,
title="Synthetic data: one feature",
labels={"x": "x", "y": "y"},
)
fig.show()
Closed-form for the one-feature case#
When there’s only one feature, OLS has a compact solution.
Let \(\bar{x}\) and \(\bar{y}\) be the means. Then:
Interpretation:
numerator = how \(x\) and \(y\) “move together” (covariance)
denominator = how much \(x\) varies
If \(x\) doesn’t vary at all, you can’t learn a slope.
# Closed-form simple regression (one feature)
x_mean = x_simple.mean()
y_mean = y_simple.mean()
beta1_hat = np.sum((x_simple - x_mean) * (y_simple - y_mean)) / np.sum((x_simple - x_mean) ** 2)
beta0_hat = y_mean - beta1_hat * x_mean
beta0_hat, beta1_hat
(2.996017195166255, 1.9971905099769254)
# Plot the fitted line
x_line = np.linspace(x_simple.min(), x_simple.max(), 200)
y_line = beta0_hat + beta1_hat * x_line
fig = go.Figure()
fig.add_trace(go.Scatter(x=x_simple, y=y_simple, mode="markers", name="data"))
fig.add_trace(go.Scatter(x=x_line, y=y_line, mode="lines", name="OLS fit"))
fig.update_layout(
title="Simple linear regression: OLS line",
xaxis_title="x",
yaxis_title="y",
)
fig.show()
Residuals: “what the story doesn’t explain”#
Residuals are \(r_i = y_i - \hat{y}_i\).
A friendly mental model: if the model is a story about the data, residuals are the parts the story can’t explain.
y_pred_simple = beta0_hat + beta1_hat * x_simple
residuals = y_simple - y_pred_simple
fig = px.histogram(residuals, nbins=30, title="Residual distribution (simple regression)")
fig.update_layout(xaxis_title="residual", yaxis_title="count")
fig.show()
3) Multiple linear regression (many features)#
With multiple features, we write predictions compactly.
Create a design matrix with an intercept column:
Then:
Same idea, more dials.
Why multiple regression gets tricky (multicollinearity)#
If two features carry almost the same information (e.g., x2 ≈ x1), OLS can still fit well, but the individual coefficients can become unstable.
Analogy: two coworkers both trying to take credit for the same project.
We’ll build a dataset with correlated features to make this visible.
# Synthetic multi-feature dataset with correlated features
n_samples_multi = 300
x1 = rng.normal(0, 1, size=n_samples_multi)
# x2 is highly correlated with x1
x2 = 0.85 * x1 + rng.normal(0, 0.25, size=n_samples_multi)
# x3 is mostly independent
x3 = rng.normal(0, 1, size=n_samples_multi)
X_raw = np.column_stack([x1, x2, x3])
feature_names = ["x1 (signal)", "x2 (correlated)", "x3 (signal)"]
true_intercept = 1.5
true_beta = np.array([2.0, -1.5, 0.7])
y_multi = true_intercept + X_raw @ true_beta + rng.normal(0, 1.0, size=n_samples_multi)
corr = np.corrcoef(X_raw, rowvar=False)
fig = px.imshow(corr, x=feature_names, y=feature_names, text_auto=True, title="Feature correlation")
fig.show()
X_train, X_test, y_train, y_test = train_test_split(X_raw, y_multi, test_size=0.25, random_state=42)
X_train.shape, X_test.shape
((225, 3), (75, 3))
4) Getting betas (three ways)#
4.1 Closed-form (normal equation)#
OLS minimizes:
Take derivatives, set to zero → the normal equation:
If \(X^\top X\) is invertible:
In practice: avoid explicit inverses. Use solve, lstsq, or factorizations.
def add_intercept_column(X: np.ndarray) -> np.ndarray:
return np.column_stack([np.ones(X.shape[0]), X])
def ols_via_lstsq(X: np.ndarray, y: np.ndarray) -> np.ndarray:
# Solves min ||Xb - y||_2 using a stable method (SVD under the hood)
beta, *_ = np.linalg.lstsq(X, y, rcond=None)
return beta
def ols_via_normal_equation_solve(X: np.ndarray, y: np.ndarray) -> np.ndarray:
# Solves (X^T X) b = X^T y without forming an explicit inverse
XtX = X.T @ X
Xty = X.T @ y
return np.linalg.solve(XtX, Xty)
def ols_via_cholesky(X: np.ndarray, y: np.ndarray) -> np.ndarray:
# If XtX is SPD, we can solve using a Cholesky factorization: XtX = L L^T
XtX = X.T @ X
Xty = X.T @ y
L = np.linalg.cholesky(XtX)
z = np.linalg.solve(L, Xty)
beta = np.linalg.solve(L.T, z)
return beta
X_design = add_intercept_column(X_train)
beta_lstsq = ols_via_lstsq(X_design, y_train)
beta_solve = ols_via_normal_equation_solve(X_design, y_train)
beta_chol = ols_via_cholesky(X_design, y_train)
beta_lstsq, beta_solve, beta_chol
(array([ 1.5167, 2.2547, -1.8184, 0.674 ]),
array([ 1.5167, 2.2547, -1.8184, 0.674 ]),
array([ 1.5167, 2.2547, -1.8184, 0.674 ]))
# Compare to the true parameters
beta_true = np.concatenate([[true_intercept], true_beta])
labels = ["intercept", "x1", "x2", "x3"]
fig = go.Figure()
fig.add_trace(go.Bar(name="true", x=labels, y=beta_true))
fig.add_trace(go.Bar(name="OLS (lstsq)", x=labels, y=beta_lstsq))
fig.add_trace(go.Bar(name="OLS (solve)", x=labels, y=beta_solve))
fig.add_trace(go.Bar(name="OLS (cholesky)", x=labels, y=beta_chol))
fig.update_layout(barmode="group", title="Coefficients: true vs OLS solutions")
fig.show()
4.2 Gradient descent (optimization)#
Closed-form is great, but:
it can be expensive when \(d\) is large
it doesn’t exist for Lasso
it’s not how many modern models are trained
Gradient descent treats the loss like a landscape and repeatedly takes steps “downhill”.
For MSE with an intercept column included, a common gradient form is:
# Demonstrate on the simple 1D dataset
X_simple = add_intercept_column(x_simple.reshape(-1, 1))
beta_gd, gd_steps, gd_losses, gd_betas = gradient_descent_linear_regression(
X_simple,
y_simple,
learning_rate=0.01,
n_steps=3000,
)
beta_gd
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[10], line 4
1 # Demonstrate on the simple 1D dataset
2 X_simple = add_intercept_column(x_simple.reshape(-1, 1))
----> 4 beta_gd, gd_steps, gd_losses, gd_betas = gradient_descent_linear_regression(
5 X_simple,
6 y_simple,
7 learning_rate=0.01,
8 n_steps=3000,
9 )
11 beta_gd
NameError: name 'gradient_descent_linear_regression' is not defined
fig = px.line(x=gd_steps, y=gd_losses, title="Gradient descent: MSE over iterations")
fig.update_layout(xaxis_title="iteration", yaxis_title="MSE")
fig.show()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[11], line 1
----> 1 fig = px.line(x=gd_steps, y=gd_losses, title="Gradient descent: MSE over iterations")
2 fig.update_layout(xaxis_title="iteration", yaxis_title="MSE")
3 fig.show()
NameError: name 'gd_steps' is not defined
# Visualize the loss surface J(beta0, beta1) and the GD path
beta0_grid = np.linspace(beta0_hat - 6, beta0_hat + 6, 80)
beta1_grid = np.linspace(beta1_hat - 2.5, beta1_hat + 2.5, 80)
B0, B1 = np.meshgrid(beta0_grid, beta1_grid)
# Compute MSE on the grid
Y_pred_grid = B0[..., None] + B1[..., None] * x_simple[None, None, :]
MSE_grid = np.mean((Y_pred_grid - y_simple[None, None, :]) ** 2, axis=-1)
fig = go.Figure()
fig.add_trace(
go.Contour(
x=beta0_grid,
y=beta1_grid,
z=MSE_grid,
contours_coloring="heatmap",
showscale=True,
name="MSE",
)
)
fig.add_trace(
go.Scatter(
x=gd_betas[:, 0],
y=gd_betas[:, 1],
mode="lines+markers",
name="GD path",
line=dict(color="black"),
)
)
fig.update_layout(
title="Loss landscape (contours) + gradient descent path",
xaxis_title="beta0",
yaxis_title="beta1",
)
fig.show()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[12], line 24
11 fig = go.Figure()
12 fig.add_trace(
13 go.Contour(
14 x=beta0_grid,
(...) 20 )
21 )
22 fig.add_trace(
23 go.Scatter(
---> 24 x=gd_betas[:, 0],
25 y=gd_betas[:, 1],
26 mode="lines+markers",
27 name="GD path",
28 line=dict(color="black"),
29 )
30 )
31 fig.update_layout(
32 title="Loss landscape (contours) + gradient descent path",
33 xaxis_title="beta0",
34 yaxis_title="beta1",
35 )
36 fig.show()
NameError: name 'gd_betas' is not defined
5) Regularization: Ridge, Lasso, Elastic Net#
When features are correlated or numerous, OLS can “spread credit” in unstable ways.
Regularization adds a preference:
Ridge (L2): “Prefer smaller coefficients.” (shrinks smoothly)
Lasso (L1): “Prefer fewer non-zero coefficients.” (can set some to zero)
Elastic Net: “A mix of both.”
Analogy:
Ridge is like a safety belt: you can still move, but extreme swings are damped.
Lasso is like a budget: if you pay for one coefficient, you have less to spend on others.
Important: regularization depends on feature scale → standardize your features.
Standardization helpers (from scratch)#
We’ll standardize using train statistics:
Then we fit on scaled features and convert back to original units.
def standardize_fit(X_train: np.ndarray) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
mean = X_train.mean(axis=0)
std = X_train.std(axis=0, ddof=0)
std = np.where(std == 0, 1.0, std)
return mean, std, (X_train - mean) / std
def standardize_apply(X: np.ndarray, mean: np.ndarray, std: np.ndarray) -> np.ndarray:
return (X - mean) / std
def unscale_coefficients(
beta_scaled: np.ndarray,
feature_mean: np.ndarray,
feature_std: np.ndarray,
target_mean: float,
) -> tuple[float, np.ndarray]:
# If y_centered = y - target_mean and X_scaled = (X - mean)/std
# then y = target_mean + X_scaled @ beta_scaled
# and beta_original = beta_scaled / std
beta_original = beta_scaled / feature_std
intercept_original = target_mean - feature_mean @ beta_original
return float(intercept_original), beta_original
def predict_linear(X: np.ndarray, intercept: float, beta: np.ndarray) -> np.ndarray:
return intercept + X @ beta
feature_mean, feature_std, X_train_scaled = standardize_fit(X_train)
y_train_mean = y_train.mean()
y_train_centered = y_train - y_train_mean
X_test_scaled = standardize_apply(X_test, feature_mean, feature_std)
X_train_scaled.shape
(225, 3)
5.1 Ridge regression (L2)#
Ridge regression (as implemented in sklearn.linear_model.Ridge) minimizes:
\(\alpha \ge 0\) controls the strength (bigger → more shrinkage)
\(\alpha = 0\) reduces to OLS
The solution is still linear algebra—just with a “stabilized” matrix:
Practical note:
We typically do not penalize the intercept.
In this notebook we center \(y\) and standardize \(X\), then recover the intercept from the training mean.
def ridge_closed_form(X: np.ndarray, y: np.ndarray, alpha: float) -> np.ndarray:
n_features = X.shape[1]
XtX = X.T @ X
Xty = X.T @ y
return np.linalg.solve(XtX + alpha * np.eye(n_features), Xty)
alphas = np.logspace(-3, 2, 40)
ridge_coefs_scaled = []
for alpha in alphas:
ridge_coefs_scaled.append(ridge_closed_form(X_train_scaled, y_train_centered, alpha))
ridge_coefs_scaled = np.vstack(ridge_coefs_scaled)
fig = go.Figure()
for j, name in enumerate(feature_names):
fig.add_trace(go.Scatter(x=alphas, y=ridge_coefs_scaled[:, j], mode="lines", name=name))
fig.update_layout(
title="Ridge coefficient paths (scaled features)",
xaxis_title="alpha (log scale)",
yaxis_title="coefficient (scaled)",
xaxis_type="log",
)
fig.show()
# Evaluate ridge for a few alpha values (convert back to original units)
def fit_ridge_and_score(alpha: float) -> dict:
beta_scaled = ridge_closed_form(X_train_scaled, y_train_centered, alpha)
intercept, beta = unscale_coefficients(beta_scaled, feature_mean, feature_std, y_train_mean)
y_pred = predict_linear(X_test, intercept, beta)
return {
"alpha": alpha,
"mse": mean_squared_error(y_test, y_pred),
"r2": r2_score(y_test, y_pred),
"intercept": intercept,
"beta": beta,
}
[fit_ridge_and_score(a) for a in [0.0, 0.1, 1.0, 10.0]]
[{'alpha': 0.0,
'mse': 1.221855599191807,
'r2': 0.5613497744532487,
'intercept': 1.5167030598770732,
'beta': array([ 2.2547, -1.8184, 0.674 ])},
{'alpha': 0.1,
'mse': 1.2206829487952962,
'r2': 0.5617707598473143,
'intercept': 1.5172826696255273,
'beta': array([ 2.2349, -1.7957, 0.6741])},
{'alpha': 1.0,
'mse': 1.2137240283725328,
'r2': 0.564269035433256,
'intercept': 1.5220139241124422,
'beta': array([ 2.073 , -1.6111, 0.6742])},
{'alpha': 10.0,
'mse': 1.2517741006491223,
'r2': 0.5506089328833015,
'intercept': 1.5458280302704095,
'beta': array([ 1.2685, -0.7048, 0.6596])}]
Ridge via Cholesky (linear algebra)#
Ridge requires solving a symmetric positive definite system (for \(\alpha > 0\)):
A classic approach is Cholesky factorization:
factor \(A = X^\top X + \alpha I\) as \(A = LL^\top\)
solve \(Lz = X^\top y\)
solve \(L^\top\beta = z\)
This avoids computing any inverse explicitly.
def ridge_via_cholesky(X: np.ndarray, y: np.ndarray, alpha: float) -> np.ndarray:
n_features = X.shape[1]
A = (X.T @ X) + alpha * np.eye(n_features)
b = X.T @ y
L = np.linalg.cholesky(A)
z = np.linalg.solve(L, b)
return np.linalg.solve(L.T, z)
alpha_demo = 1.0
beta_solve = ridge_closed_form(X_train_scaled, y_train_centered, alpha=alpha_demo)
beta_chol = ridge_via_cholesky(X_train_scaled, y_train_centered, alpha=alpha_demo)
np.max(np.abs(beta_solve - beta_chol))
2.220446049250313e-15
Choosing \(\alpha\): a quick validation curve#
In real projects you tune \(\alpha\) using a validation set or cross-validation.
Below is a simple validation curve for Ridge (using an sklearn pipeline so scaling is fit only on the training split).
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=7)
alphas_ridge = np.logspace(-3, 3, 60)
val_mse = []
for a in alphas_ridge:
model = make_pipeline(StandardScaler(), Ridge(alpha=a))
model.fit(X_tr, y_tr)
val_mse.append(mean_squared_error(y_val, model.predict(X_val)))
val_mse = np.array(val_mse)
best_alpha = float(alphas_ridge[np.argmin(val_mse)])
fig = px.line(x=alphas_ridge, y=val_mse, title=f"Ridge validation curve (best alpha ≈ {best_alpha:.4g})")
fig.update_layout(xaxis_title="alpha (log scale)", yaxis_title="validation MSE", xaxis_type="log")
fig.show()
best_alpha
0.001
5.2 Lasso regression (L1)#
Lasso replaces the L2 penalty with L1:
Key effect: L1 tends to create sparsity (some coefficients become exactly 0).
There is no closed-form solution in general.
A common solver is coordinate descent:
hold all coefficients fixed
update one coefficient at a time
repeat until convergence
The core operation becomes soft-thresholding:
def soft_threshold(z: float, gamma: float) -> float:
if z > gamma:
return z - gamma
if z < -gamma:
return z + gamma
return 0.0
def lasso_coordinate_descent(
X: np.ndarray,
y: np.ndarray,
alpha: float,
n_iter: int = 5000,
tol: float = 1e-7,
warm_start: np.ndarray | None = None,
) -> np.ndarray:
# Objective used here:
# (1/(2n)) ||y - Xb||^2 + alpha ||b||_1
# Assumes y is centered and columns of X are standardized.
n_samples, n_features = X.shape
beta = np.zeros(n_features) if warm_start is None else warm_start.copy()
residuals = y - X @ beta
feature_norms = np.mean(X ** 2, axis=0)
for _ in range(n_iter):
beta_prev = beta.copy()
for j in range(n_features):
residuals = residuals + X[:, j] * beta[j]
rho = np.mean(X[:, j] * residuals)
beta[j] = soft_threshold(rho, alpha) / feature_norms[j]
residuals = residuals - X[:, j] * beta[j]
if np.max(np.abs(beta - beta_prev)) < tol:
break
return beta
alphas = np.logspace(-3, 0.8, 45)
lasso_coefs_scaled = []
beta_ws = None
for alpha in alphas:
beta_ws = lasso_coordinate_descent(X_train_scaled, y_train_centered, alpha=alpha, warm_start=beta_ws)
lasso_coefs_scaled.append(beta_ws.copy())
lasso_coefs_scaled = np.vstack(lasso_coefs_scaled)
fig = go.Figure()
for j, name in enumerate(feature_names):
fig.add_trace(go.Scatter(x=alphas, y=lasso_coefs_scaled[:, j], mode="lines", name=name))
fig.update_layout(
title="Lasso coefficient paths (scaled features)",
xaxis_title="alpha (log scale)",
yaxis_title="coefficient (scaled)",
xaxis_type="log",
)
fig.show()
# Evaluate lasso for a few alpha values
def fit_lasso_and_score(alpha: float) -> dict:
beta_scaled = lasso_coordinate_descent(X_train_scaled, y_train_centered, alpha=alpha)
intercept, beta = unscale_coefficients(beta_scaled, feature_mean, feature_std, y_train_mean)
y_pred = predict_linear(X_test, intercept, beta)
return {
"alpha": alpha,
"mse": mean_squared_error(y_test, y_pred),
"r2": r2_score(y_test, y_pred),
"beta": beta,
"num_nonzero": int(np.sum(np.abs(beta_scaled) > 1e-10)),
}
[fit_lasso_and_score(a) for a in [0.01, 0.05, 0.1, 0.2]]
[{'alpha': 0.01,
'mse': 1.2158339344013227,
'r2': 0.5635115721487047,
'beta': array([ 2.023 , -1.5535, 0.6674]),
'num_nonzero': 3},
{'alpha': 0.05,
'mse': 1.27909288052184,
'r2': 0.5408014000121864,
'beta': array([ 1.0961, -0.494 , 0.6406]),
'num_nonzero': 3},
{'alpha': 0.1,
'mse': 1.3964369179099974,
'r2': 0.4986745001551752,
'beta': array([0.6242, 0. , 0.5845]),
'num_nonzero': 2},
{'alpha': 0.2,
'mse': 1.5759458587189812,
'r2': 0.4342301931310043,
'beta': array([0.4975, 0. , 0.4452]),
'num_nonzero': 2}]
5.3 Elastic Net (L1 + L2)#
Elastic Net combines both penalties:
\(\alpha\) controls overall regularization
\(\rho \in [0,1]\) (often called
l1_ratio) controls the mix\(\rho=1\) → Lasso
\(\rho=0\) → Ridge
Elastic Net is especially useful when features are correlated: it can select groups of features instead of arbitrarily picking one.
def elastic_net_coordinate_descent(
X: np.ndarray,
y: np.ndarray,
alpha: float,
l1_ratio: float,
n_iter: int = 5000,
tol: float = 1e-7,
warm_start: np.ndarray | None = None,
) -> np.ndarray:
# Objective used here:
# (1/(2n)) ||y - Xb||^2 + alpha * (l1_ratio ||b||_1 + (1-l1_ratio)/2 ||b||_2^2)
n_samples, n_features = X.shape
beta = np.zeros(n_features) if warm_start is None else warm_start.copy()
residuals = y - X @ beta
feature_norms = np.mean(X ** 2, axis=0)
l1 = alpha * l1_ratio
l2 = alpha * (1.0 - l1_ratio)
for _ in range(n_iter):
beta_prev = beta.copy()
for j in range(n_features):
residuals = residuals + X[:, j] * beta[j]
rho = np.mean(X[:, j] * residuals)
beta[j] = soft_threshold(rho, l1) / (feature_norms[j] + l2)
residuals = residuals - X[:, j] * beta[j]
if np.max(np.abs(beta - beta_prev)) < tol:
break
return beta
alphas = np.logspace(-3, 0.8, 45)
l1_ratio = 0.5
enet_coefs_scaled = []
beta_ws = None
for alpha in alphas:
beta_ws = elastic_net_coordinate_descent(
X_train_scaled, y_train_centered, alpha=alpha, l1_ratio=l1_ratio, warm_start=beta_ws
)
enet_coefs_scaled.append(beta_ws.copy())
enet_coefs_scaled = np.vstack(enet_coefs_scaled)
fig = go.Figure()
for j, name in enumerate(feature_names):
fig.add_trace(go.Scatter(x=alphas, y=enet_coefs_scaled[:, j], mode="lines", name=name))
fig.update_layout(
title=f"Elastic Net coefficient paths (l1_ratio={l1_ratio})",
xaxis_title="alpha (log scale)",
yaxis_title="coefficient (scaled)",
xaxis_type="log",
)
fig.show()
6) scikit-learn equivalents + parameter intuition#
The sklearn versions are production-grade implementations with good defaults.
A few important notes:
Most
sklearnlinear models expect raw features and handle intercept internally (fit_intercept=Trueby default).For Ridge/Lasso/ElasticNet, feature scaling is critical → use
StandardScaler().LinearRegressionsolves OLS using a stable decomposition (no gradient descent).
Below we compare OLS/Ridge/Lasso/ElasticNet with a consistent pipeline.
models = {
"OLS": make_pipeline(StandardScaler(), LinearRegression()),
"Ridge(alpha=1.0)": make_pipeline(StandardScaler(), Ridge(alpha=1.0)),
"Lasso(alpha=0.1)": make_pipeline(StandardScaler(), Lasso(alpha=0.1, max_iter=50_000)),
"ElasticNet(alpha=0.1,l1_ratio=0.5)": make_pipeline(
StandardScaler(), ElasticNet(alpha=0.1, l1_ratio=0.5, max_iter=50_000)
),
"SGDRegressor (GD)": make_pipeline(
StandardScaler(),
SGDRegressor(
loss="squared_error",
penalty="l2",
alpha=1e-4,
learning_rate="invscaling",
eta0=0.01,
max_iter=5000,
tol=1e-6,
random_state=42,
),
),
}
rows = []
for name, model in models.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
rows.append(
{
"model": name,
"mse": mean_squared_error(y_test, y_pred),
"r2": r2_score(y_test, y_pred),
}
)
rows
[{'model': 'OLS', 'mse': 1.2218555991918068, 'r2': 0.5613497744532487},
{'model': 'Ridge(alpha=1.0)',
'mse': 1.2137240283725332,
'r2': 0.5642690354332558},
{'model': 'Lasso(alpha=0.1)',
'mse': 1.3964372563318266,
'r2': 0.4986743786606125},
{'model': 'ElasticNet(alpha=0.1,l1_ratio=0.5)',
'mse': 1.365775969037568,
'r2': 0.5096818828174675},
{'model': 'SGDRegressor (GD)',
'mse': 1.2112856232633256,
'r2': 0.5651444309806803}]
Parameter cheat sheet (the ones you’ll actually touch)#
LinearRegression#
fit_intercept: include intercept termpositive: force coefficients to be non-negative
Ridge#
alpha: L2 strength (bigger → more shrinkage)solver: numerical method (usually leave asauto)fit_intercept: intercept handling
Lasso#
alpha: L1 strength (bigger → more sparsity)max_iter,tol: convergence controlsselection:cyclicvsrandomcoordinate updates
ElasticNet#
alpha: overall regularizationl1_ratio: mix between L1 and L2max_iter,tol: convergence controls
SGDRegressor#
loss="squared_error": linear regression losspenalty:"l2","l1","elasticnet"alpha: regularization strength (note: different meaning/scale vs Ridge/Lasso)learning_rate,eta0: step size schedule
7) Practical checklist + pitfalls#
Always split train/test (or use CV) before deciding on
alpha.For Ridge/Lasso/ElasticNet:
standardize features (
StandardScaler)don’t leak test information into scaling
Don’t interpret Lasso sparsity as “the truth” without domain checks.
If features are strongly correlated:
Ridge often stabilizes coefficients
Elastic Net can select groups
Lasso may pick one arbitrarily
When to use what (rough intuition)#
OLS: small/clean problems, inference-focused, or as a baseline
Ridge: lots of correlated features, prediction-focused stability
Lasso: you want a sparse model (feature selection)
Elastic Net: you want sparsity and correlated-feature friendliness
Exercises#
Increase correlation between
x1andx2and see how OLS coefficients behave.Plot test MSE vs
alphafor Ridge/Lasso and pick the best.Add irrelevant noisy features and compare OLS vs Ridge.
References#
ESL (Hastie, Tibshirani, Friedman): The Elements of Statistical Learning
ISLR (James, Witten, Hastie, Tibshirani): An Introduction to Statistical Learning
sklearndocs forRidge,Lasso,ElasticNet